This assignment is focused on exploring categorical-to-continuous variable relationships and continuous-to-continuous variable relationships. It is not open ended like the last two assignments. There are certain tasks you must complete for all problems, but you will gain experience with the different plot types introduced in the Week 07 recordings. You will practice creating, modifying, interpreting, and communicating insights from them. The last question requires you to visually explore relationships associated with one of the final projects of your choosing.
You must download the 3 data sets provided in the Canvas assignment page and save them to the appropriate directory on your computer.
Shiyi Wang
For each of the 3 assigned data sets you must perform the following ESSENTIAL activities:
You do NOT need to display basic descriptive statistics and counts. You will visually explore the variables in each problem.
You will work with the NumPy, Pandas, matplotlib.pyplot, and Seaborn modules in this assignment.
Import NumPy, Pandas, matplotlib.pyplot, and Seaborn using their commonly accepted aliases.
###
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df01 = pd.read_csv('hw07_prob_01.csv')
### 1.1 Basic info on number of rows and columns, names of columns and data types
print(df01.shape)
df01.info()
(2800, 2) <class 'pandas.core.frame.DataFrame'> RangeIndex: 2800 entries, 0 to 2799 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x 2800 non-null object 1 value 2800 non-null float64 dtypes: float64(1), object(1) memory usage: 43.9+ KB
### 1.2 Check for missing values
df01.isnull().sum()
x 0 value 0 dtype: int64
### 1.3 Check for unique values
df01.nunique()
x 4 value 2800 dtype: int64
### 1.4 Describe all columns
df01.describe(include='all')
| x | value | |
|---|---|---|
| count | 2800 | 2800.000000 |
| unique | 4 | NaN |
| top | A | NaN |
| freq | 700 | NaN |
| mean | NaN | 3.602424 |
| std | NaN | 3.092654 |
| min | NaN | -5.421891 |
| 25% | NaN | 1.301328 |
| 50% | NaN | 3.040259 |
| 75% | NaN | 5.619022 |
| max | NaN | 20.348324 |
### 2.1 Bar chart for non-numeric columns
sns.catplot(data = df01, x = 'x', kind='count')
<seaborn.axisgrid.FacetGrid at 0x24ab8404670>
They are balanced.
### 2.2 Histogram for numeric column marginal distribution
sns.displot(data=df01, x='value', kind='hist')
<seaborn.axisgrid.FacetGrid at 0x24abed7cc40>
It is not symmetric.
You will now explore the categorical-to-continuous relationship between the non-numeric column and numeric column in df01.
Create a BOX PLOT using Seaborn to visualize the summary statistics of the numeric column GIVEN the non-numeric column.
Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the BOX PLOT?
### 2.3 Boxplot to visualize the summary statistics of the numeric column GIVEN the non-numeric column
sns.catplot(data=df01, x='x', y='value', kind='box',
showmeans=True,
meanprops={'marker':'o','markerfacecolor':'white','markeredgecolor':'black'})
<seaborn.axisgrid.FacetGrid at 0x24abda81dc0>
Yes, they are different.
### 2.4 Point plot to compare the conditional means of the numeric column GIVEN the non-numeric column
sns.catplot(data=df01, x='x', y='value', kind='point',join=False)
C:\Users\Fengyeng\AppData\Local\Temp\ipykernel_28176\2359634449.py:3: UserWarning: The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`. sns.catplot(data=df01, x='x', y='value', kind='point',join=False)
<seaborn.axisgrid.FacetGrid at 0x24abed131c0>
Yes, they are different.
### 2.5 Violin plot to visualize conditional density of the numeric column GIVEN the non-numeric column
sns.catplot(data=df01, x='x', y='value', kind='violin')
<seaborn.axisgrid.FacetGrid at 0x24ac7693460>
Yes, they are different.
Create a CONDITIONAL KDE plot using Seaborn to show the conditional density of the numeric column GIVEN the non-numeric column. The non-numeric column must be associated with the KDE color.
Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the CONDITIONAL KDE plot?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?
HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?
### 2.6 Conditional KDE plot to show conditional density of the numeric column GIVEN the non-numeric column, column associated with KDE color
sns.displot(data=df01, x='value', hue='x', kind='kde',common_norm=False)
<seaborn.axisgrid.FacetGrid at 0x24ac0181d60>
Yes, they are different.
Create a FACTED HISTOGRAM plot using Seaborn to show the conditional histogram of the numeric column GIVEN the non-numeric column. The non-numeric column must be associated with the COLUMN FACETS. The x and y scales of the facets must be free or not-shared across the facets.
Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the FACTED HISTOGRAM?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the COLUMN FACET to a column in the data?
### 2.7 Facet histogram to show conditional histogram of numeric column GIVEN the non-numeric column, the non-numeric column as the facet, the x and y are not shared
sns.displot(data=df01, x='value', col='x', kind='hist',
aspect=0.75,
facet_kws={'sharex':False,'sharey':False})
<seaborn.axisgrid.FacetGrid at 0x24ac5e1f3a0>
Yes, they are different.
You have explored the CONDITIONAL DISTRIBUTIONS of the numeric column GIVEN the non-numeric column.
Which plot types made it easy to COMPARE summary statistics across the categories?
Which plot types made it easy to COMPARE the distributional SHAPE across the categories?
What do you think?
Box plot made it easy to compare summary statistics across the categories.
Faceted histogram made it easy to compare the distributional shape across the categories.
###
df02 = pd.read_csv('hw07_prob_02.csv')
### 1.1 Basic info on number of rows and columns, names of columns and data types
print(df02.shape)
df02.info()
(900, 3) <class 'pandas.core.frame.DataFrame'> RangeIndex: 900 entries, 0 to 899 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x1 900 non-null float64 1 x2 900 non-null float64 2 m1 900 non-null object dtypes: float64(2), object(1) memory usage: 21.2+ KB
### 1.2 Check for missing values
df02.isnull().sum()
x1 0 x2 0 m1 0 dtype: int64
### 1.3 Check for unique values
df02.nunique()
x1 900 x2 900 m1 9 dtype: int64
### 1.4 Describe all columns
df02.describe(include='all')
| x1 | x2 | m1 | |
|---|---|---|---|
| count | 900.000000 | 900.000000 | 900 |
| unique | NaN | NaN | 9 |
| top | NaN | NaN | A |
| freq | NaN | NaN | 100 |
| mean | 0.019472 | 0.039430 | NaN |
| std | 1.038965 | 1.037111 | NaN |
| min | -3.435065 | -2.940527 | NaN |
| 25% | -0.705076 | -0.707391 | NaN |
| 50% | -0.020579 | 0.043096 | NaN |
| 75% | 0.728071 | 0.735318 | NaN |
| max | 3.068722 | 2.886683 | NaN |
### 2.1 Bar chart for non-numeric columns
sns.catplot(data=df02, x='m1', kind='count')
<seaborn.axisgrid.FacetGrid at 0x24ac7b71580>
Yes, they are balanced.
Create HISTOGRAMS using Seaborn to visualize the marginal distributions of the continuous variables in df02.
You may create separate figures for each histogram based on the WIDE FORMAT data OR reshape the data into LONG FORMAT and create separate FACETS for each variable. You CANNOT use for-loops to create the separate histograms.
Do the marginal distribution appear symmetric?
### 2.2 Histogram for numeric column marginal distribution
sns.displot(data=df02, x='x1', kind='hist', bins=20)
<seaborn.axisgrid.FacetGrid at 0x24acb414d90>
sns.displot(data=df02, x='x2', kind='hist', bins=20)
<seaborn.axisgrid.FacetGrid at 0x24ace651520>
### 2.2.0 Melt the data frame for numeric columns
df02_lf = df02.reset_index().\
rename(columns={'index':'id'}).\
melt(id_vars=['id','m1'], value_vars=['x1','x2'])
### 2.2.1 Histogram for numeric column marginal distribution
sns.displot(data=df02_lf, x='value', kind='hist',col='variable',bins= 20)
<seaborn.axisgrid.FacetGrid at 0x24acb0d3d90>
I notice slight difference in x2 figure between value 2 and 3, due to different binning scheme.
They are kind of symmetric.
Create CONDITIONAL KDE plots using Seaborn to show the conditional densities of each numeric column GIVEN the non-numeric column. The non-numeric column must be associated with the KDE color.
You may create separate figures for each histogram based on the WIDE FORMAT data OR reshape the data into LONG FORMAT and create separate FACETS for each variable. You CANNOT use for-loops to create the separate histograms.
Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the CONDITIONAL KDE plot?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?
HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?
### 2.6 Conditional KDE plot to show conditional density of the numeric column GIVEN the non-numeric column, column associated with KDE color
sns.displot(data=df02_lf, x='value', hue='m1', kind='kde',col = 'variable' ,common_norm=False)
<seaborn.axisgrid.FacetGrid at 0x24ace6fac70>
No. They are kind of similar.
Create BOX PLOTS using Seaborn to visualize the summary statistics of the numeric columns GIVEN the non-numeric column.
You may create separate figures for each boxplot based on the WIDE FORMAT data OR reshape the data into LONG FORMAT and create separate FACETS for each variable. You CANNOT use for-loops to create the separate boxplots.
Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT according to the BOX PLOT?
### 2.3 Boxplot to visualize the summary statistics of the numeric column GIVEN the non-numeric column
sns.catplot(data=df02_lf, x='m1', y='value', kind='box',col='variable')
<seaborn.axisgrid.FacetGrid at 0x24ace6499d0>
No, they are not appearently different.
### 2.4 Scatter plot to visualize the relationship between the numeric columns GIVEN the non-numeric column
sns.relplot(data=df02, x='x1', y='x2', kind='scatter')
<seaborn.axisgrid.FacetGrid at 0x24ad13fa760>
No, I cannot see any clear relationships between the two.
Let's now check if the continuous variable relationship depends on the non-numeric variable.
Create a scatter plot between the continuous variables using Seaborn. Color the markers based on the non-numeric column to study if the relationship CHANGES across the categories.
Does the CONDITIONAL RELATIONSHIP appear DIFFERENT across the CATEGORIES?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?
### 2.4.1 Scatter plot to visualize the relationship between the numeric columns GIVEN the non-numeric column, with markers colored by the non-numeric column
sns.relplot(data=df02, x='x1', y='x2', kind='scatter', hue='m1')
<seaborn.axisgrid.FacetGrid at 0x24ad149da60>
Yes, the appear different.
Let's include a TREND line within the scatter plot to help visualize the linear relationship between the two continuous variables. Let's begin by IGNORING the potential influence of the non-numeric column.
Create a scatter plot which includes a trend line to show the linear relationship between the two numeric columns. You should NOT color based on the non-numeric columnn.
What kind of relationship does the TREND line represent when the non-numeric column is ignored?
### 2.5 Trend plot
sns.lmplot(data=df02, x='x1', y='x2')
<seaborn.axisgrid.FacetGrid at 0x24ace4e27c0>
The trend line shows that there is little relationship between the two numeric columns.
Let's now include TREND lines that are associated with the categories of the non-numeric column.
Create a scatter plot which includes trend lines to show the linear relationship between the numeric columns. Color the markers and the trend lines based on the non-numeric column.
Does the CONDITIONAL RELATIONSHIP appear DIFFERENT across the CATEGORIES?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?
###
sns.lmplot(data=df02, x='x1', y='x2', hue='m1')
<seaborn.axisgrid.FacetGrid at 0x24ad14edb80>
Yes, it shows many different relationships between the two numeric columns between different categories.
Lastly, let's FACET by the non-numeric column!
Create a scatter plot which includes trend lines to show the linear relationship between the numeric columns. Color the markers and trend lines and FACET based on the non-numeric column. The color and facets are therefore associated with the SAME variable.
The facets should have 3 columns per row.
Does the CONDITIONAL RELATIONSHIP appear DIFFERENT across the CATEGORIES?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the COLUMN FACET to a column in the data?
### 2.7 FACET non-numeric column
sns.lmplot(data=df02, x='x1', y='x2', hue='m1', col='m1', col_wrap=3)
<seaborn.axisgrid.FacetGrid at 0x24ad62c8700>
Yes, it shows many different relationships between the two numeric columns between different categories.
You will continue working with the data from Problem 02 to explore the relationship between the two continuous variables.
Linear relationships can be summarized by calculating the correlation coefficient between the numeric columns. The correlation coefficients can be visualized as correlation plots via heat maps. However, let's first practice calculating the correlation matrix between the two numeric columns in df02.
Display the correlation matrix for the numeric columns in df02 to the screen. You do NOT need to assign the correlation matrix to an object.
### 3.1 Correlation matrix for numeric columns in df02
df02.corr(numeric_only=True)
| x1 | x2 | |
|---|---|---|
| x1 | 1.000000 | 0.021982 |
| x2 | 0.021982 | 1.000000 |
Let's now VISUALIZE the correlation plot as a heat map!
Create a correlation plot between the numeric columns in df02. The correlation plot must be created using Seaborn. You must use a DIVERGING color palette with the correct bounds and midpoint. The correlation plot must be annotated.
You must ignore the non-numeric column for this correlation plot.
### 3.2 Heatmap for correlation matrix
sns.heatmap(df02.corr(numeric_only=True),
vmin=-1, vmax=1, center=0, cmap='coolwarm',
annot=True,annot_kws={'size':20},
cbar=False,
fmt='.2f')
<AxesSubplot: >
Let's now examine if the correlation plot CHANGES across the categories of the non-numeric column. However, let's practice calculating the grouped correlation matrix BEFORE visualizing the correlation plot.
Display the grouped correlation matrix for the numeric columns in df02 to the screen. You must group by the non-numeric column. You do NOT need to assign the correlation matrix to an object.
### 3.3 Grouped correlation matrix
df02.groupby('m1').corr(numeric_only=True)
| x1 | x2 | ||
|---|---|---|---|
| m1 | |||
| A | x1 | 1.000000 | -0.991282 |
| x2 | -0.991282 | 1.000000 | |
| B | x1 | 1.000000 | -0.880486 |
| x2 | -0.880486 | 1.000000 | |
| C | x1 | 1.000000 | -0.722998 |
| x2 | -0.722998 | 1.000000 | |
| D | x1 | 1.000000 | -0.395593 |
| x2 | -0.395593 | 1.000000 | |
| E | x1 | 1.000000 | -0.059890 |
| x2 | -0.059890 | 1.000000 | |
| F | x1 | 1.000000 | 0.270515 |
| x2 | 0.270515 | 1.000000 | |
| G | x1 | 1.000000 | 0.785730 |
| x2 | 0.785730 | 1.000000 | |
| H | x1 | 1.000000 | 0.902762 |
| x2 | 0.902762 | 1.000000 | |
| I | x1 | 1.000000 | 0.992068 |
| x2 | 0.992068 | 1.000000 |
Let's now VISUALIZE the grouped correlation plot!
Create a grouped correlation plot between the numeric columns in df02. You must group by the non-numeric column. The separate categories on the non-numeric column must be associated with separate subplots. The subplot title must be specified correctly to make it clear which subplot is associated with which value of the non-numeric column. You must use a DIVERGING color palette with the correct bounds and midpoint. The correlation plot must be annotated.
### 3.4 Grouped heatmap for correlation matrix
the_groups = df02.m1.unique().tolist()
corr_per_group = df02.groupby('m1').corr(numeric_only=True)
fig,axs = plt.subplots(len(the_groups),1,figsize=(5,50),sharex=True,sharey=True)
for ix in range(len(the_groups)):
sns.heatmap(corr_per_group.loc[the_groups[ix],:],
vmin=-1, vmax=1, center=0, cmap='coolwarm',
annot=True,annot_kws={'size':10},
cbar=False,
fmt='.2f',
ax=axs[ix])
axs[ix].set_title('m1: %s' % the_groups[ix])
You have visualized the distributions and relationship between the continuous variables in df02 several ways. Let's conclude by working with a plot type that combines both aspects into a single graphic.
Create a PAIRS PLOT to show the marginal histograms and scatter plot between the numeric columns in df02. You must ignore the non-column.
### 3.5 Pairplot for numeric columns in df02
sns.pairplot(data=df02, vars=['x1','x2'],
diag_kws={'common_norm':False})
<seaborn.axisgrid.PairGrid at 0x24ad95a4520>
CONDITIONAL DISTRIBUTIONS and CONDITIONAL RELATIONSHIPS can be shown within a PAIRS PLOT. The non-numeric column can be associated with COLOR which creates separate colored CONDITIONAL DISTRIBUTIONS and separate colored MARKERS within the SCATTER PLOTS. You must COLOR the PAIRS PLOT by the non-numeric column.
HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?
### 3.6 Grouped pairplot for numeric columns in df02
sns.pairplot(data=df02, vars=['x1','x2'], hue='m1',
diag_kws={'common_norm':False})
<seaborn.axisgrid.PairGrid at 0x24ad95a41f0>
You have visually explore the relationship between the numeric columns many different ways. You ignored the non-numeric column, as well as examined if the relationship CHANGED across the categories of the non-numeric column.
Which plot type did you feel was the easiet for identifying if the relationship changed across the categories of the non-numeric column?
What do you think?
The pairplot is the easiest for identifying if the relationship changed across the categories of the non-numeric column, however, heatmap is more clear to show the correlation between the two numeric columns.
###
df04 = pd.read_csv('hw07_prob_04.csv')
### 1.1 Basic info on number of rows and columns, names of columns and data types
print(df04.shape)
df04.info()
(633, 13) <class 'pandas.core.frame.DataFrame'> RangeIndex: 633 entries, 0 to 632 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x01 633 non-null float64 1 x02 633 non-null float64 2 x03 633 non-null float64 3 x04 633 non-null float64 4 x05 633 non-null float64 5 x06 633 non-null float64 6 x07 633 non-null float64 7 x08 633 non-null float64 8 x09 633 non-null float64 9 x10 633 non-null float64 10 x11 633 non-null float64 11 x12 633 non-null float64 12 v 633 non-null object dtypes: float64(12), object(1) memory usage: 64.4+ KB
### 1.2 Check for missing values
df04.isnull().sum()
x01 0 x02 0 x03 0 x04 0 x05 0 x06 0 x07 0 x08 0 x09 0 x10 0 x11 0 x12 0 v 0 dtype: int64
### 1.3 Check for unique values
df04.nunique()
x01 633 x02 633 x03 633 x04 633 x05 633 x06 633 x07 633 x08 633 x09 633 x10 633 x11 633 x12 633 v 3 dtype: int64
### 1.4 Describe all columns
df04.describe(include='all')
| x01 | x02 | x03 | x04 | x05 | x06 | x07 | x08 | x09 | x10 | x11 | x12 | v | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 633.000000 | 633.000000 | 633.000000 | 633.000000 | 633.000000 | 633.000000 | 633.000000 | 633.000000 | 633.000000 | 633.000000 | 633.000000 | 633.000000 | 633 |
| unique | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3 |
| top | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A1 |
| freq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 211 |
| mean | 0.017201 | -0.117542 | 0.028487 | -0.098241 | 0.053754 | -0.062829 | 0.078069 | -0.030272 | 0.028753 | -0.062405 | 0.025930 | -0.044316 | NaN |
| std | 0.985951 | 1.007731 | 3.752440 | 0.991107 | 0.984573 | 0.949131 | 1.022223 | 1.035076 | 3.897745 | 1.003886 | 0.998188 | 0.971035 | NaN |
| min | -2.962974 | -3.314671 | -7.369698 | -2.603420 | -3.378890 | -3.024685 | -2.985966 | -2.777091 | -7.422036 | -2.802904 | -3.056180 | -2.715199 | NaN |
| 25% | -0.645987 | -0.832902 | -3.563217 | -0.783992 | -0.623991 | -0.698531 | -0.600231 | -0.714685 | -3.907429 | -0.715740 | -0.663492 | -0.718663 | NaN |
| 50% | 0.010903 | -0.087405 | 0.040706 | -0.111189 | -0.000370 | -0.068309 | 0.052151 | -0.024770 | 0.046817 | -0.065431 | 0.036369 | -0.024997 | NaN |
| 75% | 0.661162 | 0.533091 | 3.743382 | 0.597446 | 0.714583 | 0.573939 | 0.807849 | 0.703130 | 3.955639 | 0.632654 | 0.674309 | 0.640892 | NaN |
| max | 2.805809 | 2.773446 | 7.370209 | 2.883285 | 2.948060 | 2.660166 | 2.950523 | 2.759512 | 6.958443 | 2.998344 | 2.975531 | 2.948998 | NaN |
### 2.1 Bar chart for non-numeric columns
sns.catplot(data=df04, x='v', kind='count')
<seaborn.axisgrid.FacetGrid at 0x24ad95a4b50>
Yes, they are balanced.
It is best to study the marginal distributions and then conditional distributions associated with continuous variables (numeric columns) BEFORE exploring the relationships between them. However, we will modify the typical EDA workflow for this problem. Let's jump to using the PAIRS PLOT which allows exploring distributions and relationships within a single graphic. We will revisit the distributions in more detail later.
Create a PAIRS PLOT associated with all numeric columns in df04 using Seaborn.
What does this specific PAIRS PLOT reveal about the variables and their relationships?
### 2.2 Pairplot for numeric columns in df04
sns.pairplot(data=df04, diag_kws={'common_norm':False})
<seaborn.axisgrid.PairGrid at 0x24ac1a4df40>
This plot shows that some complicated relationships between the variables, with multiple lines in the scatter plot.
Most of the continuous variables are symmetric and have a shape similar to normal distribution.
Let's now examine if the non-numeric column impacts the continuous variables. Create a PAIRS PLOT for the numeric columns and COLOR based on the non-numeric column using Seaborn.
What does this specific grouped PAIRS PLOT reveal about the impact of the non-numeric column on the continuous variables?
HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?
### 2.3 non-numeric column as the difference and colors.
sns.pairplot(data=df04, hue='v', diag_kws={'common_norm':False})
<seaborn.axisgrid.PairGrid at 0x24ae2f273a0>
Let's now summarize the linear relationships between numeric columns using a CORRELATION PLOT. You do NOT need to display the correlation matrix first this time. Instead, we will jump straight to visualizing the CORRELATION PLOT.
Create a correlation plot between the numeric columns in df04. The correlation plot must be created using Seaborn. You must use a DIVERGING color palette with the correct bounds and midpoint.
Do you feel this correlation plot needs to be annotated? Try annoting the correlation plot and then NOT annotating it. Are you able to reach the same conclusions without the annotated text?
You must ignore the non-numeric column for this correlation plot.
### 2.4 Linear relationships between the numeric columns using correlation plot
# annot = True
sns.heatmap(df04.corr(numeric_only=True),
vmin=-1, vmax=1, center=0, cmap='coolwarm',
annot=True,annot_kws={'size':10},
cbar=False,
fmt='.2f')
<AxesSubplot: >
# annot = False
sns.heatmap(df04.corr(numeric_only=True),
vmin=-1, vmax=1, center=0, cmap='coolwarm',
annot=False,annot_kws={'size':20},
cbar=False,
fmt='.2f')
<AxesSubplot: >
I think it's ok not to annotate the correlation plot, since the darkness of different colors are clear enough to show the whether it is positive correlation or not, and the degree of correlation between the variables.
I can draw same conclusions without the annotated text.
Let's now group the correlation plot by the non-numeric column.
Create a grouped correlation plot between the numeric columns in df04. You must group by the non-numeric column. The separate categories on the non-numeric column must be associated with separate subplots. The subplot title must be specified correctly to make it clear which subplot is associated with which value of the non-numeric column. You must use a DIVERGING color palette with the correct bounds and midpoint.
Do you feel this correlation plot needs to be annotated? Try annoting the correlation plot and then NOT annotating it. Are you able to reach the same conclusions without the annotated text?
### 2.5 Linear relationships between the numeric columns using heatmap and grouped by the non-numeric column
the_groups = df04.v.unique().tolist()
corr_per_group = df04.groupby('v').corr(numeric_only=True)
fig,axs = plt.subplots(len(the_groups),1,figsize=(8,25),sharex=True,sharey=True)
for ix in range(len(the_groups)):
sns.heatmap(corr_per_group.loc[the_groups[ix],:],
vmin=-1, vmax=1, center=0, cmap='coolwarm',
annot=True,annot_kws={'size':10},
cbar=False,
fmt='.2f',
ax=axs[ix])
axs[ix].set_title('v: %s' % the_groups[ix])
# without annot
the_groups = df04.v.unique().tolist()
corr_per_group = df04.groupby('v').corr(numeric_only=True)
fig,axs = plt.subplots(len(the_groups),1,figsize=(8,25),sharex=True,sharey=True)
for ix in range(len(the_groups)):
sns.heatmap(corr_per_group.loc[the_groups[ix],:],
vmin=-1, vmax=1, center=0, cmap='coolwarm',
annot=False,annot_kws={'size':10},
cbar=False,
fmt='.2f',
ax=axs[ix])
axs[ix].set_title('v: %s' % the_groups[ix])
I think it's ok not to annotate the correlation plot, since the darkness of different colors are clear enough to show the whether it is positive correlation or not, and the degree of correlation between the variables.
I can draw same conclusions without the annotated text.
Let's now return explore the continuous variable distributions in depth for df04. You have seen that there are more than just a few continuous variables in this data set! It might seem like we need to perform a lot of tedious actions to explore all of the variables. But, you do NOT need to manually create all figures! You do NOT need to resort to for-loops either! Instead, the data can be RESHAPED from the current WIDE-FORMAT to LONG-FORMAT. This allows associating Seaborn's FACETS with the continuous variables!
First, display the number of rows and columns in df04 as a reminder.
###
print(df04.shape)
(633, 13)
Reshape the df04 WIDE-FORMAT DataFrame into LONG-FORMAT. The numeric columns of df04. MUST be "gathered up" or STACKED on top of each other. The non-numeric column must NOT be gathered up. You MUST include a column named rowid that corresponds to the row index. The rowid column must NOT be gathered up with the other numeric columns.
Assign the LONG-FORMAT data set to the lf04 object.
Display the .info() method for the LONG-FORMAT object to the screen.
###
df04_features = df04.select_dtypes('number').copy()
df04_objects = df04.select_dtypes('object').copy()
id_cols = ['rowid'] + df04_objects.columns.tolist()
lf04 = df04.reset_index().\
rename(columns={'index':'rowid'}).\
melt(id_vars=id_cols, value_vars=df04_features.columns)
lf04.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7596 entries, 0 to 7595 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 rowid 7596 non-null int64 1 v 7596 non-null object 2 variable 7596 non-null object 3 value 7596 non-null float64 dtypes: float64(1), int64(1), object(2) memory usage: 237.5+ KB
lf04
| rowid | v | variable | value | |
|---|---|---|---|---|
| 0 | 0 | A1 | x01 | 1.264427 |
| 1 | 1 | A1 | x01 | 1.192453 |
| 2 | 2 | A1 | x01 | 0.687623 |
| 3 | 3 | A1 | x01 | -0.440204 |
| 4 | 4 | A1 | x01 | -0.017212 |
| ... | ... | ... | ... | ... |
| 7591 | 628 | C3 | x12 | -0.633067 |
| 7592 | 629 | C3 | x12 | -1.049238 |
| 7593 | 630 | C3 | x12 | -0.912559 |
| 7594 | 631 | C3 | x12 | -0.627877 |
| 7595 | 632 | C3 | x12 | -0.748772 |
7596 rows × 4 columns
7596 rows and 4 columns.
The rows number is 633 * 12, which is the number of numeric elements in df04.
You can now use the LONG-FORMAT data to visually explore the numeric columns in df04!
Visualize the marginal distributions for each numeric variable in df04 using the LONG-FORMAT lf04 object and Seaborn. You must associate the correct newly created "gathered" value column with the x axis argument. You must associate the column facets with the correct newly created "gathered" variable column. You must use 21 bins to create the histograms. The figure should have 4 facets per row. The x and y scales of the facets must be free or not-shared across the facets.
How would you describe the SHAPES of the continuous variable distributions?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the COLUMN FACET to a column in the data?
###
sns.displot(data=lf04, x='value', col='variable', kind='hist',
facet_kws={'sharex':False,'sharey':False},
col_wrap=4,height=2,aspect=1,
bins = 21)
<seaborn.axisgrid.FacetGrid at 0x24af20cf700>
Their shapes are mostly similar to normal distribution, but some of them are like combinations of multiple normal distributions.
The lf04 LONG-FORMAT DataFrame has a separate column for the non-numeric column in df04. Thus, it was NOT "gathered" with the numeric columns. You can therefore use the non-numeric column as a GROUPING variable in the visualizations!
Visualize the CONDITIONAL KDE plots for each numeric variable in df04 within FACETS of a single figure. Each facet must be associated with one of the "original" numeric columns in df04. You must associate the correct newly created "gathered" value column in the x axis argument. You must associate the column facets with the correct newly created "gathered" variable column. You must associate the "original" df04 non-numeric column with the CONDITIONAL KDE color. The figure should have 4 facets per row. The x and y scales of the facets must be free or not-shared across the facets.
Do the CONDITIONAL DISTRIBUTIONS appear DIFFERENT across the categories of the non-numeric column?
HINT: Which Seaborn argument allows you to ASSOCIATE or link the color to a column in the data?
HINT: What do you need to set to make sure the SAMPLE SIZE effect is removed?
###
sns.displot(data=lf04, x='value', col='variable', kind='kde',hue='v',
col_wrap=4,height=2,aspect=1,
facet_kws={'sharex':False,'sharey':False},
common_norm=False)
<seaborn.axisgrid.FacetGrid at 0x24a91013310>
For variable x03,x09, they are different across the categories of the non-numeric column.
Although there are multiple conditional distribution plots we should use to fully explore the data, you will conclude this assignment with a BOXPLOT. You will create separate BOXPLOTS for each "original" numeric column within FACETS of a single figure. Each facet must be associated with one of the "original" numeric columns in df04. You must associate the "original" df04 non-numeric column with the x axis argument. You must associate the correct newly created "gathered" value column with the y axis argument. You must associate the column facets with the correct newly created "gathered" variable column.
Experiment with using shared x and y axis scales across the FACETS and NOT SHARING the x and y axis scales. Which approach seems best for this particular data set?
###
sns.catplot(data=lf04, x='v', y='value', col='variable',kind='box',
sharex=False,sharey=False,
col_wrap=4,height=2,aspect=1)
<seaborn.axisgrid.FacetGrid at 0x24ace65a850>
sns.catplot(data=lf04, x='v', y='value', col='variable',kind='box',
sharex=True,sharey=True,
col_wrap=4,height=2,aspect=1)
<seaborn.axisgrid.FacetGrid at 0x24acf7ba070>
Not share x and y is better.
You must download the data associated with one of the Final Projects from the Canvas site. Save the file(s) in the same directory as this Jupyter notebook. You may use the same project as the previous assignment OR switch to a different project.
Read in the data associated with one of the Final Projects. You previously visually explored MARGINAL behavior. You must now begin to visually explore relationships between variables in the Project data. However, you do NOT need to explore ALL relationships this assignment.
You MUST create at least 6 plots which explore relationships between variables. Those plots can be categorical-to-categorical relationships (combinations), categorical-to-continuous relationships, and/or continuous-to-continuous relationships. The exact type of plots you should use depend on the project.
However, 2 of the plots MUST involve MORE than 2 variables.
Add as many cells as you feel are necessary.
df_input = pd.read_csv('trial_inputs.csv')
df_output = pd.read_csv('trial_outputs.csv')
df_output_max_cycle = df_output.groupby('trial_id').last().reset_index()
# concat input and output
df_trial = pd.concat([df_input,df_output_max_cycle[['cycle','y']]],axis=1)
### 1.1 Basic info on number of rows and columns, names of columns and data types
print(df_trial.shape)
df_trial.info()
(240, 9) <class 'pandas.core.frame.DataFrame'> RangeIndex: 240 entries, 0 to 239 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 trial_id 240 non-null int64 1 x1 240 non-null int64 2 x2 240 non-null int64 3 x3 240 non-null int64 4 x4 240 non-null int64 5 x5 240 non-null int64 6 x6 240 non-null object 7 cycle 240 non-null int64 8 y 240 non-null float64 dtypes: float64(1), int64(7), object(1) memory usage: 17.0+ KB
### 1.2 Check for missing values
df_trial.isnull().sum()
trial_id 0 x1 0 x2 0 x3 0 x4 0 x5 0 x6 0 cycle 0 y 0 dtype: int64
### 1.3 Check for unique values
df_trial.nunique()
trial_id 240 x1 3 x2 3 x3 2 x4 2 x5 3 x6 4 cycle 48 y 240 dtype: int64
### 1.4 Describe all columns
df_trial.describe(include='all')
| trial_id | x1 | x2 | x3 | x4 | x5 | x6 | cycle | y | |
|---|---|---|---|---|---|---|---|---|---|
| count | 240.00000 | 240.000000 | 240.000000 | 240.00000 | 240.00000 | 240.000000 | 240 | 240.000000 | 240.000000 |
| unique | NaN | NaN | NaN | NaN | NaN | NaN | 4 | NaN | NaN |
| top | NaN | NaN | NaN | NaN | NaN | NaN | A | NaN | NaN |
| freq | NaN | NaN | NaN | NaN | NaN | NaN | 60 | NaN | NaN |
| mean | 120.50000 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.000000 | NaN | 72.629167 | 0.000267 |
| std | 69.42622 | 0.731823 | 0.731823 | 1.00209 | 1.00209 | 0.731823 | NaN | 30.387594 | 0.000040 |
| min | 1.00000 | -1.000000 | -1.000000 | -1.00000 | -1.00000 | -1.000000 | NaN | 7.000000 | 0.000169 |
| 25% | 60.75000 | -1.000000 | -1.000000 | -1.00000 | -1.00000 | -1.000000 | NaN | 47.000000 | 0.000236 |
| 50% | 120.50000 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.000000 | NaN | 96.000000 | 0.000271 |
| 75% | 180.25000 | 1.000000 | 1.000000 | 1.00000 | 1.00000 | 1.000000 | NaN | 100.000000 | 0.000282 |
| max | 240.00000 | 1.000000 | 1.000000 | 1.00000 | 1.00000 | 1.000000 | NaN | 100.000000 | 0.000355 |
### 1.5 wide to long
df_trial_lf = df_trial.reset_index().\
rename(columns={'index':'id'}).\
melt(id_vars=['id','trial_id','cycle','y','x6'], value_vars=df_input.columns.tolist())
df_trial_lf
| id | trial_id | cycle | y | x6 | variable | value | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 39 | 0.000263 | A | x1 | -1 |
| 1 | 1 | 2 | 52 | 0.000279 | A | x1 | 1 |
| 2 | 2 | 3 | 38 | 0.000266 | A | x1 | -1 |
| 3 | 3 | 4 | 50 | 0.000287 | A | x1 | 1 |
| 4 | 4 | 5 | 40 | 0.000268 | A | x1 | -1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1195 | 235 | 236 | 100 | 0.000301 | D | x5 | 1 |
| 1196 | 236 | 237 | 83 | 0.000262 | D | x5 | 1 |
| 1197 | 237 | 238 | 100 | 0.000345 | D | x5 | 0 |
| 1198 | 238 | 239 | 100 | 0.000345 | D | x5 | 0 |
| 1199 | 239 | 240 | 81 | 0.000265 | D | x5 | 0 |
1200 rows × 7 columns
### 2.1 Bar chart for non-numeric columns with facet on variable
sns.displot(data=df_trial_lf, x='value', col='variable', kind='hist',
facet_kws={'sharex':False,'sharey':False},
col_wrap=3,height=3,aspect=1,
bins = 6)
<seaborn.axisgrid.FacetGrid at 0x24b906850a0>
### 2.2 Histogram for numeric column marginal distribution
sns.displot(data=df_trial, x='y', kind='hist', bins=20, height=4, aspect=2)
<seaborn.axisgrid.FacetGrid at 0x24a9df99820>
### 2.3 Boxplot to visualize the summary statistics of the numeric column GIVEN the non-numeric column
sns.catplot(data=df_trial_lf, x = 'x6', y='cycle', kind='box',
col='variable',col_wrap=3,height=3,aspect=1,
sharex=False,sharey=False,
showmeans=True,
meanprops={'marker':'o','markerfacecolor':'white','markeredgecolor':'black'})
<seaborn.axisgrid.FacetGrid at 0x24b94385070>
### 2.4 Point plot to compare the conditional means of the numeric column GIVEN the non-numeric column
sns.catplot(data=df_trial_lf, x = 'x6', y='cycle', kind='point',join=False,
col='variable',col_wrap=3,height=3,aspect=1)
C:\Users\Fengyeng\AppData\Local\Temp\ipykernel_28176\2004049749.py:2: UserWarning: The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`. sns.catplot(data=df_trial_lf, x = 'x6', y='cycle', kind='point',join=False,
<seaborn.axisgrid.FacetGrid at 0x24b94022130>
## 2.5 Pairplot for numeric columns in df_trial
sns.pairplot(data=df_trial, vars=['trial_id','y','cycle'],
diag_kws={'common_norm':False})
<seaborn.axisgrid.PairGrid at 0x24afa93d1c0>
### Heatmap for correlation matrix
sns.heatmap(df_trial[['trial_id','cycle','y']].corr(numeric_only=True),
vmin=-1, vmax=1, center=0, cmap='coolwarm',
annot=True,annot_kws={'size':10},
cbar=False,
fmt='.2f')
<AxesSubplot: >